Introduction

The purpose of this course is to teach you the basics of the R language and give you the confidence to tackle larger projects using the language. Importantly, we want to get you thinking like a programmer. This doesn’t mean that by the end of the course you will know R fully, but you will know enough so you can go online and look for the help you need to complete most tasks.

Practice makes perfect

Programming is like any skill, the more you practice the better you get. It’s really important that you keep using what you have learned after the course is completed otherwise there is a good chance you will forget everything and you’ll be back to square one.

Why Use R?

R is a high-level programming language with a focus on mathematics and statistics, but R can be used for a wide variety of applications given the flexibility of the language. R is also free, and available for all operating systems. Given the richness of the language and no cost to use it, bioinformaticians have used R for more than 20 years as the platform for which which to develop packages to solve bioinformatics problems.

The BioConductor Project is a repository for bioinformatics tools which continues to grow, and hosts packages such as DESeq2 which you may have heard of. Some other popular packages such as Seurat aren’t actually hosted by Bioconductor, but in the main R package repository. We’ll cover package installation later later.

Which other languages do bioinformaticians use?

The Python language has been rocketing in popularity for the past few years, particularly among data scientists who make use of the AI/ML tools such as Tensorflow and PyTorch. Scanpy is a very popular package for single-cell analysis. For very computationally intensive tasks (e.g sequence alignment), languages such as C/C++/Rust are more commonly used, but these are far more difficult to learn.

How will this course work?

We’re going to take a different approach to this course, so it’s somewhat experimental at this point. You will be taught the basics of the R language doing small exercises along the way. However, we will finish by you undertaking a project which will push you quite hard. The aim is that by tacking a more difficult problem will consolidate what you have learnt, and learn more by having to look up solutions to the problems you will likely face.

Getting R and RStudio

Point your browser to http://cran.r-project.org/ to download and install the latest version of R. For these tutorials we are also going to use RStudio which is an advanced development environment for R which includes a window for an editor, console, and plotting window. You will see what this means later.

RStudio

Open up RStudio, and it will look something like this:

The different parts are:

  1. The code editor. This is where you write code.
  2. The R console. This is the R environment where R code is executed.
  3. Workspace. The objects you create along the way will be listed here.
  4. Plots and files. Plots will render here, and files can be browsed in the “Files” pane.

Before we start, we need to do a little prep.

  1. On your computer, make a folder called “Rcourse1”.

We then set the working directory to this folder, so

  1. In RStudio go to Session > Set Working Directory > Choose Directory and find the “Rcourse1” folder and select it.

RStudio will now be looking for files in this folder, and any saved plots will be put here unless stated otherwise.

Now, go to File > New File > R Script

A new empty script will open up in the top left window. Go to File > Save and give it a name. It will be saved to you current working directory. You should see your file being added to the list in the Plot and Files pane.

Now that we’ve done our prep, let do some R.

The Basics

We’ll now look at some basic operations. The code should be copied into your R script as we go along.

Assigning a variable.

Into your script copy/type the following line:

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

This will make a vector of values from 1 to 10, and put them into a variable called x.

Execute the code by hitting the “Run” button at the top-right of the script window. You will see this line appear in the R console below.

To view the contents of the object you have just created, just type x in the console and hit return:

x
##  [1]  1  2  3  4  5  6  7  8  9 10

The contents of x are now printed out.

Now is a good time to learn about commenting and documenting code. This is free text you put into your scripts that tell the reader whats going on. It also helps remind your future self of what you did or your thought process. Comments are put in using #, so for example:

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # This is a comment.

Anything after a # will be ignored by R. You can run the code again to check.

Rather than typing in the value 1 to 10, there is a much simpler way to create the same vector using :

x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10

Much better! Using a colon will always do increments of 1, and it’s also bidirectional:

y <- 5:-5
y
##  [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5

Another way of creating a sequence of numbers is to use the seq function. To learn how this function works, issue the command help(seq). In R you can get a manual for any function using the help() command (or e.g. ?seq). To generate a vector of numbers from 1 to 100 in steps of 10 we need:

a <- seq(0, 100, by = 10)
a
##  [1]   0  10  20  30  40  50  60  70  80  90 100

Exercise: Generate a vector called ‘b’ ranging from 3 to 987 where the length of the vector is 53 entries long. Done? Check the length of the vector you have just made by issuing length(b).

a <- seq(3, 978, length = 53)
a

A note about assigning things to variable names. I use <-, but you can also use = too. So

g <- 1:5
g
## [1] 1 2 3 4 5

Is the same as:

g = 1:5
g
## [1] 1 2 3 4 5

We can also make a new vector d using a vector c:

c <- 1:50 #make a vector c ranging from 1:50
d <- 1/c  #make a vector d by dividing 1 by c
d
##  [1] 1.00000000 0.50000000 0.33333333 0.25000000 0.20000000 0.16666667
##  [7] 0.14285714 0.12500000 0.11111111 0.10000000 0.09090909 0.08333333
## [13] 0.07692308 0.07142857 0.06666667 0.06250000 0.05882353 0.05555556
## [19] 0.05263158 0.05000000 0.04761905 0.04545455 0.04347826 0.04166667
## [25] 0.04000000 0.03846154 0.03703704 0.03571429 0.03448276 0.03333333
## [31] 0.03225806 0.03125000 0.03030303 0.02941176 0.02857143 0.02777778
## [37] 0.02702703 0.02631579 0.02564103 0.02500000 0.02439024 0.02380952
## [43] 0.02325581 0.02272727 0.02222222 0.02173913 0.02127660 0.02083333
## [49] 0.02040816 0.02000000

And do maths on them:

mean(d) # calculate the mean of the vector
## [1] 0.08998411
sd(d) # the standard deviation
## [1] 0.1578087

Conditionals

This is important stuff. If we want to ask whether something is equal to something else, we need to use the == operator, NOT =. Try this:

x == 5
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

you will see that all are marked FALSE apart from the 5th which is TRUE. If you want to know specifically which element equals 5, use which:

which(x == 5)
## [1] 5

And you’ll see it now says the 5th element of vector x equals 5. A few more:

which(x < 5)
which(x <= 5)
which(x >= 5)
which(x > 5)
which(x != 5)

Lets make another vector y:

y <- 7:10

Finding the intersect is easy:

intersect(x,y)
## [1]  7  8  9 10

What do you think this does?:

x[!x %in% y]

Basic plotting

Lets use this vector d and plot them:

plot(d)

Note the way the axes are labelled in the plot function.

Exercise: Call help(plot) in the console and read about the other plot options available. Produce the same plot as above, but this time as a line which is coloured red. Also, label the axes \(c\) and \(d\) and give the plot a title.

plot(d, ty = "l", col = "red", xlab = "c", ylab = "d", main = "My first plot")

This plot was done using base R. This means we used the plotting functions native, or built-in to the R language. Plotting has been vamped up a lot in R since ggplot was introduced some time ago, the aim here to make plotting more customisable. Lets try the same plot but using ggplot. First, install it:

install.packages(ggplot2) #you only have to do this once.

Now call the ggplot library:

library(ggplot2)

The first thing we need to do is make a data.frame of the data by doing:

df <- data.frame(c = c, d = d)

This makes a table. You can see the first few lines by doing:

df[1:5,]
##   c         d
## 1 1 1.0000000
## 2 2 0.5000000
## 3 3 0.3333333
## 4 4 0.2500000
## 5 5 0.2000000

Lets make the plot:

ggplot(data = df, aes(x = c, y = d)) + geom_line()

We can even add the points:

ggplot(data = df, aes(x = c, y = d)) + geom_line() + geom_point()

and change the colour of each:

ggplot(data = df, aes(x = c, y = d)) + geom_line(color = "red") + geom_point(color = "green")

So ggplot is all about layering information onto plots, and this makes it very customisable. It definitely has a steeper learning curve, but worth it if you produce lots of plots on a day-to-day. If you’re struggling with the idea, use this analogy of ggplots as a cake (Tanya Shapiro (@tanya_shapiro), Twitter):

Matrices

Matrices are the most common data format bioinformaticians work with (microarray/RNAseq data for example). Lets make one:

m <- matrix(0, ncol = 5, nrow = 10)
m
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    0    0    0    0    0
##  [2,]    0    0    0    0    0
##  [3,]    0    0    0    0    0
##  [4,]    0    0    0    0    0
##  [5,]    0    0    0    0    0
##  [6,]    0    0    0    0    0
##  [7,]    0    0    0    0    0
##  [8,]    0    0    0    0    0
##  [9,]    0    0    0    0    0
## [10,]    0    0    0    0    0

This will create a matrix filled with zeros. To transpose (flip) the matrix we use t() (this will be important later!)

tposed.m <- t(m)
tposed.m
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    0    0    0     0
## [2,]    0    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    0    0    0    0    0     0
## [4,]    0    0    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     0

We can also add names to the row and columns:

rownames(m) <- LETTERS[1:10]
colnames(m) <- c("cat", "dog", "pig", "cow", "chicken")
m
##   cat dog pig cow chicken
## A   0   0   0   0       0
## B   0   0   0   0       0
## C   0   0   0   0       0
## D   0   0   0   0       0
## E   0   0   0   0       0
## F   0   0   0   0       0
## G   0   0   0   0       0
## H   0   0   0   0       0
## I   0   0   0   0       0
## J   0   0   0   0       0

Subsetting

Lets make a matrix (and a vector) containing integer values so we can take a look at how subsetting work in R:

v <- 1:10
m <- t(matrix(1:50, ncol = 10, nrow = 5))
rownames(m) <- LETTERS[1:10]
colnames(m) <- c("cat", "dog", "pig", "cow", "chicken")
m
##   cat dog pig cow chicken
## A   1   2   3   4       5
## B   6   7   8   9      10
## C  11  12  13  14      15
## D  16  17  18  19      20
## E  21  22  23  24      25
## F  26  27  28  29      30
## G  31  32  33  34      35
## H  36  37  38  39      40
## I  41  42  43  44      45
## J  46  47  48  49      50

We can access individual elements using square brackets []. Here are some examples:

v[c(7,1,5)] #access elements 7 1 and 5 of the vector
## [1] 7 1 5
m[1,] # access the first row of the matrix
##     cat     dog     pig     cow chicken 
##       1       2       3       4       5
m[,3] # the 3rd column
##  A  B  C  D  E  F  G  H  I  J 
##  3  8 13 18 23 28 33 38 43 48
m[8,2] # the value in the 8th row and 2nd column
## [1] 37
m[3:7,4] # the 3rd to 7th row of the 4th column
##  C  D  E  F  G 
## 14 19 24 29 34

Note the c(7,1,5) where we subset vector v. c means combine and it allows element in the () to be collected into a single vector.

We can also address elements using the column and row names:

m["B",] # gets the row labelled B
##     cat     dog     pig     cow chicken 
##       6       7       8       9      10
m["B", "cow"]
## [1] 9
m[c("F", "J"),c("chicken", "cat", "pig")]
##   chicken cat pig
## F      30  26  28
## J      50  46  48

We often need to collect vectors and assemble them into a matrix. This can be done using the rbind (row) and cbind (column) functions:

v1 <- 1:10
v2 <- 101:110
rbound.mat <-rbind(v1, v2)
cbound.mat <- cbind(v1, v2)
rbound.mat
##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## v1    1    2    3    4    5    6    7    8    9    10
## v2  101  102  103  104  105  106  107  108  109   110
cbound.mat
##       v1  v2
##  [1,]  1 101
##  [2,]  2 102
##  [3,]  3 103
##  [4,]  4 104
##  [5,]  5 105
##  [6,]  6 106
##  [7,]  7 107
##  [8,]  8 108
##  [9,]  9 109
## [10,] 10 110

We will do a lot more work with matrices later, particularly mathematical operations.

Lists

So far we have talked about vectors and matrices. Often we want to collect these things and put them into one object under a single variable. For example:

alpha <- LETTERS[1:8]
mat <- matrix(rnorm(40), nrow = 8)
listex1  <- list(char = alpha,nums = mat)

You can see that each item is given a name ‘char’ ‘numb’ before it is put into the list. Each element can now be accessed via $:

listex1$char
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
listex1$nums
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.2517120  1.9803582  1.0243503 -0.4516905 -1.2550951
## [2,]  0.2051176 -0.3214994 -0.5915234  0.6890398 -0.4921955
## [3,]  1.5308028 -1.1330523 -0.1604744  0.2013822  0.9822019
## [4,]  0.4186588 -1.0168480  0.4245984 -0.2813000 -0.4502648
## [5,]  0.8680424  1.0805173  0.4579380  0.3286511 -2.5338336
## [6,] -0.3212571 -0.5240946 -1.5649648 -1.5379132 -0.8466138
## [7,]  0.4246526 -0.4979848 -1.2103770 -0.6464327 -1.5322684
## [8,]  2.3768041 -0.5477924 -1.5343441 -0.5560555 -0.5094876
listex1$nums[1,] # the matrix within the list is subsetted as before
## [1] -0.2517120  1.9803582  1.0243503 -0.4516905 -1.2550951

Another way of doing the above is:

listex1[[1]] # note the double square brackets
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
listex1[[2]]
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.2517120  1.9803582  1.0243503 -0.4516905 -1.2550951
## [2,]  0.2051176 -0.3214994 -0.5915234  0.6890398 -0.4921955
## [3,]  1.5308028 -1.1330523 -0.1604744  0.2013822  0.9822019
## [4,]  0.4186588 -1.0168480  0.4245984 -0.2813000 -0.4502648
## [5,]  0.8680424  1.0805173  0.4579380  0.3286511 -2.5338336
## [6,] -0.3212571 -0.5240946 -1.5649648 -1.5379132 -0.8466138
## [7,]  0.4246526 -0.4979848 -1.2103770 -0.6464327 -1.5322684
## [8,]  2.3768041 -0.5477924 -1.5343441 -0.5560555 -0.5094876
listex1[[2]][1,]
## [1] -0.2517120  1.9803582  1.0243503 -0.4516905 -1.2550951

Lists are ok, but they can become like the wild-west where things are thrown in with little organisation. They are fine for small things, but big data shouldn’t be stored in them, or become the basis for a large project. You will see why later.

Data Frames

Data frames can be thought of as a list where each column is of the same length. It allows you to mix different classes (characters, numerics) in the same container. For example, the iris data which comes with R:

data(iris)
iris[sample(1:nrow(iris), 10),] # show 10 random rows.
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 58           4.9         2.4          3.3         1.0 versicolor
## 3            4.7         3.2          1.3         0.2     setosa
## 106          7.6         3.0          6.6         2.1  virginica
## 61           5.0         2.0          3.5         1.0 versicolor
## 40           5.1         3.4          1.5         0.2     setosa
## 129          6.4         2.8          5.6         2.1  virginica
## 63           6.0         2.2          4.0         1.0 versicolor
## 27           5.0         3.4          1.6         0.4     setosa
## 111          6.5         3.2          5.1         2.0  virginica
## 142          6.9         3.1          5.1         2.3  virginica

You see the first few column are numeric, and the last character strings. Data frames are the primary input into things like tidyverse etc, and can be addressed by their names just as you would in a list:

iris$Sepal.Length
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
plot(iris$Sepal.Length, iris$Sepal.Width)

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()

We can also break up a data frame up on attributes and store the contents in a list (which we have already discussed). For example by species:

iris.sp <- split(iris,iris$Species)
names(iris.sp)
## [1] "setosa"     "versicolor" "virginica"

Reading and writing files.

You have to get the data into R first before you can analyse it (this helps a lot). R has many useful functions to do this, so now we can take our first look at some expression data. Download this file (https://github.com/shambam/R_programming_1/blob/main/Mouse_HSPC_reduced.txt) and save it to your current working directory.

Exercise: Open the file in Excel or something to see how it looks, and then call help(read.delim) in your console. Try to work out how the file you are looking at could be read into R using this function.

This is how I would do it:

hspc.data <- read.delim("Mouse_HSPC_reduced.txt", header = T, row.name = 1, sep = "\t")

We can now look at a few aspects of the data:

colnames(hspc.data) # view the column names
##   [1] "LTHSC.1"  "LTHSC.2"  "LTHSC.3"  "LTHSC.4"  "LTHSC.5"  "LTHSC.6" 
##   [7] "LTHSC.7"  "LTHSC.8"  "LTHSC.9"  "LTHSC.10" "LTHSC.11" "LTHSC.12"
##  [13] "LTHSC.13" "LTHSC.14" "LTHSC.15" "LTHSC.16" "LTHSC.17" "LTHSC.18"
##  [19] "LTHSC.19" "LTHSC.20" "LTHSC.21" "LTHSC.22" "LTHSC.23" "LTHSC.24"
##  [25] "LTHSC.25" "LTHSC.26" "LTHSC.27" "LTHSC.28" "LTHSC.29" "LTHSC.30"
##  [31] "LTHSC.31" "LTHSC.32" "LTHSC.33" "LTHSC.34" "LTHSC.35" "LTHSC.36"
##  [37] "LTHSC.37" "LTHSC.38" "LTHSC.39" "LTHSC.40" "LTHSC.41" "LTHSC.42"
##  [43] "LTHSC.43" "LTHSC.44" "LTHSC.45" "LTHSC.46" "LTHSC.47" "LTHSC.48"
##  [49] "LTHSC.49" "LTHSC.50" "MEP.1"    "MEP.2"    "MEP.3"    "MEP.4"   
##  [55] "MEP.5"    "MEP.6"    "MEP.7"    "MEP.8"    "MEP.9"    "MEP.10"  
##  [61] "MEP.11"   "MEP.12"   "MEP.13"   "MEP.14"   "MEP.15"   "MEP.16"  
##  [67] "MEP.17"   "MEP.18"   "MEP.19"   "MEP.20"   "MEP.21"   "MEP.22"  
##  [73] "MEP.23"   "MEP.24"   "MEP.25"   "MEP.26"   "MEP.27"   "MEP.28"  
##  [79] "MEP.29"   "MEP.30"   "MEP.31"   "MEP.32"   "MEP.33"   "MEP.34"  
##  [85] "MEP.35"   "MEP.36"   "MEP.37"   "MEP.38"   "MEP.39"   "MEP.40"  
##  [91] "MEP.41"   "MEP.42"   "MEP.43"   "MEP.44"   "MEP.45"   "MEP.46"  
##  [97] "MEP.47"   "MEP.48"   "MEP.49"   "MEP.50"   "GMP.1"    "GMP.2"   
## [103] "GMP.3"    "GMP.4"    "GMP.5"    "GMP.6"    "GMP.7"    "GMP.8"   
## [109] "GMP.9"    "GMP.10"   "GMP.11"   "GMP.12"   "GMP.13"   "GMP.14"  
## [115] "GMP.15"   "GMP.16"   "GMP.17"   "GMP.18"   "GMP.19"   "GMP.20"  
## [121] "GMP.21"   "GMP.22"   "GMP.23"   "GMP.24"   "GMP.25"   "GMP.26"  
## [127] "GMP.27"   "GMP.28"   "GMP.29"   "GMP.30"   "GMP.31"   "GMP.32"  
## [133] "GMP.33"   "GMP.34"   "GMP.35"   "GMP.36"   "GMP.37"   "GMP.38"  
## [139] "GMP.39"   "GMP.40"   "GMP.41"   "GMP.42"   "GMP.43"   "GMP.44"  
## [145] "GMP.45"   "GMP.46"   "GMP.47"   "GMP.48"   "GMP.49"   "GMP.50"
nrow(hspc.data) # the number of rows in the dataset
## [1] 4170
ncol(hspc.data) # number of columns
## [1] 150
dim(hspc.data) # number of rows and columns together
## [1] 4170  150
colnames(hspc.data) #output the columns headers
##   [1] "LTHSC.1"  "LTHSC.2"  "LTHSC.3"  "LTHSC.4"  "LTHSC.5"  "LTHSC.6" 
##   [7] "LTHSC.7"  "LTHSC.8"  "LTHSC.9"  "LTHSC.10" "LTHSC.11" "LTHSC.12"
##  [13] "LTHSC.13" "LTHSC.14" "LTHSC.15" "LTHSC.16" "LTHSC.17" "LTHSC.18"
##  [19] "LTHSC.19" "LTHSC.20" "LTHSC.21" "LTHSC.22" "LTHSC.23" "LTHSC.24"
##  [25] "LTHSC.25" "LTHSC.26" "LTHSC.27" "LTHSC.28" "LTHSC.29" "LTHSC.30"
##  [31] "LTHSC.31" "LTHSC.32" "LTHSC.33" "LTHSC.34" "LTHSC.35" "LTHSC.36"
##  [37] "LTHSC.37" "LTHSC.38" "LTHSC.39" "LTHSC.40" "LTHSC.41" "LTHSC.42"
##  [43] "LTHSC.43" "LTHSC.44" "LTHSC.45" "LTHSC.46" "LTHSC.47" "LTHSC.48"
##  [49] "LTHSC.49" "LTHSC.50" "MEP.1"    "MEP.2"    "MEP.3"    "MEP.4"   
##  [55] "MEP.5"    "MEP.6"    "MEP.7"    "MEP.8"    "MEP.9"    "MEP.10"  
##  [61] "MEP.11"   "MEP.12"   "MEP.13"   "MEP.14"   "MEP.15"   "MEP.16"  
##  [67] "MEP.17"   "MEP.18"   "MEP.19"   "MEP.20"   "MEP.21"   "MEP.22"  
##  [73] "MEP.23"   "MEP.24"   "MEP.25"   "MEP.26"   "MEP.27"   "MEP.28"  
##  [79] "MEP.29"   "MEP.30"   "MEP.31"   "MEP.32"   "MEP.33"   "MEP.34"  
##  [85] "MEP.35"   "MEP.36"   "MEP.37"   "MEP.38"   "MEP.39"   "MEP.40"  
##  [91] "MEP.41"   "MEP.42"   "MEP.43"   "MEP.44"   "MEP.45"   "MEP.46"  
##  [97] "MEP.47"   "MEP.48"   "MEP.49"   "MEP.50"   "GMP.1"    "GMP.2"   
## [103] "GMP.3"    "GMP.4"    "GMP.5"    "GMP.6"    "GMP.7"    "GMP.8"   
## [109] "GMP.9"    "GMP.10"   "GMP.11"   "GMP.12"   "GMP.13"   "GMP.14"  
## [115] "GMP.15"   "GMP.16"   "GMP.17"   "GMP.18"   "GMP.19"   "GMP.20"  
## [121] "GMP.21"   "GMP.22"   "GMP.23"   "GMP.24"   "GMP.25"   "GMP.26"  
## [127] "GMP.27"   "GMP.28"   "GMP.29"   "GMP.30"   "GMP.31"   "GMP.32"  
## [133] "GMP.33"   "GMP.34"   "GMP.35"   "GMP.36"   "GMP.37"   "GMP.38"  
## [139] "GMP.39"   "GMP.40"   "GMP.41"   "GMP.42"   "GMP.43"   "GMP.44"  
## [145] "GMP.45"   "GMP.46"   "GMP.47"   "GMP.48"   "GMP.49"   "GMP.50"

Exercise: Using subsetting we learnt about earlier, split this data matrix into three parts called lthsc, mep and gmp to separate the cell types shown in the headings. For this look at the help page for a function called grep.

lthsc <- hspc.data[, grep("LTHSC", colnames(hspc.data))]
mep <- hspc.data[, grep("MEP", colnames(hspc.data))]
gmp <- hspc.data[, grep("GMP", colnames(hspc.data))]

To write a table use the write.table function:

write.table(lthsc, "LTHSC_data.txt", row.names = T, col.names = NA, sep = "\t", quote = F)

Exercise: Write out the data for the MEP and GMP data into two files.

The data tables we have now are in the form of a data.frame. Try:

class(mep)
## [1] "data.frame"

This can be an awkward format for some operations so we can convert it to a simple matrix first:

hspc.data <- as.matrix(hspc.data)
lthsc <- as.matrix(lthsc)
mep <- as.matrix(mep)
gmp <- as.matrix(gmp)

Try this now:

class(mep)
## [1] "matrix" "array"

Flow control basics

This is where things get more interesting and we start to feel like proper programmers. Now that we have these datasets loaded in R, we can use them to learn about flow control and some basic mathematical functions. We are going to do a few things the “long way” so you get the idea of how flow control works, and then we’ll look at some shortcuts.

Flow control is how multi-step processes are carried out. In the example below we print out the numbers 1 to 10:

for(i in 1:10){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

To translate this code, it simply says for every integer from 1 to 10, print this value to the screen.

Exercises: - Using the example above, print the first 10 lines of lthcs in a for loop.

  • Print every 2nd line of mep from lines 1 to 50.

An important point regarding for loops is that any processes/calculations occurring within the loop will stay in the loop. If data generated within a loop has to be retained, we need to create a container to “fill up” while the loop is being carried out.

vec <- NULL
for(i in 1:10){
  vec <- c(vec, i*10)
}

The container vec is initialised outside the loop, and then populated by concatenating to it after every iteration of the loop.

Exercise: Initialise an empty container, and for gmp, calculate the mean of each row (gene), and store the results in the container you made.

gmp.row.mean <- NULL
for(i in 1:nrow(gmp)){
  gmp.row.mean <- c(gmp.row.mean, mean(gmp[i,]))
}

Functions

Functions are chunks of code that execute several lines of code to perform a task. Once you have a few lines of useful code that you want to apply repeatedly, a function is a nice way to wrap them up so it can be used quickly when needed. lets turn the code you wrote in the previous exercise into a function where we also calculate the variance for a gene.

calc.mean.and.sd <- function(mat){
  
  mn <- NULL
  vr <- NULL
  
  for(i in 1:nrow(mat)){  
      mn <- c(mn, mean(mat[i,]))
      vr <- c(vr, var(mat[i,]))
    
  }
  res <- list(mns = mn, vars = vr)
  res # the last line in a function is what the function will return. You can also be more explicit and say return(res) 
}

You can see a loop is started, and the output from each loop is put into variables mn and vr. These are then put into a list which is returned at the end. By putting this code into a function we can now calculate the means and deviations of any matrix. For example, gmp:

gmp.mn.sd <- calc.mean.and.sd(gmp) 

Functions can also work with built-in conditions:

animal.maths <- function(value1, value2, animal = c("pig", "cow")){
  
  if(animal == "pig"){
    print(value1/value2)
  }
  if(animal == "cow"){
    print(value1*value2)
  }
  
}
animal.maths(5, 5, "pig")
## [1] 1
animal.maths(5, 5, "cow")
## [1] 25

The above can be simplified a bit further since there are only two options pig and cow by using the else statement:

animal.maths <- function(value1,value2,animal=c("pig", "cow")){
  
  if(animal == "pig"){
    print(value1/value2)
  }else{
    print(value1*value2)
  }
  
}
animal.maths(5, 5, "pig")
## [1] 1
animal.maths(5, 5, "cow")
## [1] 25

In the case above we’re only concerned if one of the arguments is pig, anything else goes:

animal.maths(5, 5, "dog")
## [1] 25

These functions can now be “banked” for use whenever they are needed (probably not animal.maths to be fair). However, you should avoid using for-loops etc altogether since R has some built in functions that are much quicker and tidier. Lets look at that now.

Apply

‘apply’ is a commonly used function in R to speed up matrix calculation. For example, to calculates means of a matrix we can do this:

lthsc.row.mn <- apply(lthsc, 1, mean) # means of rows
lthsc.col.mn <- apply(lthsc, 2, mean) # means of columns

The format for the function is therefore (1) the matrix, (2) the direction in which you would like to apply a function, and (3) the function to be applied.

Exercise: Use apply to calculate row and column totals and deviations for a yeast dataset of your choosing.

Your own functions can also be used with apply when used as the 3rd argument. Example:

example.func <- function(v){
  
  val <- (mean(v)*sd(v))/sum(v) ## This is a nonsense operation.
  val
}

ex.apply <- apply(mep, 1, example.func)

Lets use the apply function to get the top 500 most variable genes in our HSPC dataset:

gene.vars <- apply(hspc.data, 1, var)
top.var.genes <- names(rev(sort(gene.vars))[1:500])
hspc.var <- hspc.data[top.var.genes,]

Standardising data

Lets stick with our expression data and cluster it. Doing so, we’ll learn more of the language, and some of the fundamental maths that runs underneath. Lets take a look at the range of the data, i.e. getting the lowest and highest value in the matrix of variable genes we just made.

range(hspc.var)
## [1]  0.0000 15.7097

For some operations (such as making heatmaps) the data needs to be z-score normalised (scaled) first. When we scale data, each row of gene is standarised so that it’s mean = 0 and sd = 1. Specifically for a gene g of the i-th row:

\[Z_i=\frac{g_i-\hat{g}}{\sigma g_i}\]

Which means for every gene (\(g_i\)) we subtract the mean expression of the gene (\(\hat{g)}\)) to each expression value (\(g_i\)) and divide by the standard deviation of the expression of the gene (\(\sigma g_i\)).

Exercise: write a function called zscore which will take a single vector of values and scale them. When you have done this, apply this to the hspc.var matrix to scale all rows and call it hspc.zs.

zscore <- function(v){
  z <- (v-mean(v))/sd(v)
  z
}

hspc.zs <- apply(hspc.var,1,zscore)

Now take a look at the first row of the normalised data. Call nrow on the matrix. Does it look right?

hspc.zs <- t(apply(hspc.var, 1, zscore))
boxplot(hspc.zs, las = 2)

# compare to the original data
boxplot(hspc.var, las = 2)

We can see now the data has been centralised around 0.

Exercise: Do some Googling and see if there is a built in function in R that will calculate zscores for you. Use it to zscore your data and save the output in hspc.zs.v2.

hspc.zs.v2 <- t(apply(hspc.var, 1, scale))
hspc.zs.v2[1,]

Lets make sure the same thing has been returned by comparing the first row:

all.equal(hspc.zs[10,], hspc.zs.v2[10,])
## [1] "names for target but not for current"

Clustering

All the steps before have been leading to this, clustering the data and making heatmaps - a staple method in bioinformatics.

Clustering is one of the most common visualisation techniques for genes expression data. Here we will learn how to do some basic histograms/heatmaps and plotting.

R has many ways to do this, and many packages have been written specifically for expression data. We are not going to use these for now, but concentrate on the basic underlying functions that do the maths. For example, the gplots package uses the hclust function which is provided by R. So we will use hclust for now.

To use hclust we need to provide a distance matrix. This is done using the dist function:

hspc.dst <- dist(hspc.zs)

We then cluster using hclust:

hspc.hc <- hclust(hspc.dst)

Plot the dendrogram:

plot(hspc.hc)

You’ll see form this what we have clustered are the genes. If you want to cluster the cells then you need to transpose the matrix using t():

hspc.dst <- dist(t(hspc.zs)) #transpose the matrix here!
hspc.hc <- hclust(hspc.dst)
plot(hspc.hc)

We can see this is pretty much useless. It is far to compact and doesn’t really tell us anything. What we would like is to make a heatmap where the genes and samples are clustered. To do this we need to retrieve some information created by hclust

Call names to see which information is available in the newly created object:

names(hspc.hc)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"

What we need here is the component called order. We can get this using the $ assignment.

hspc.hc$order
##   [1]  51   8  80  95  54  85  91  59  76  87  97  52  66  62  63  82  55  64
##  [19]  61  75 100  58  60  94  79  30  70  56  86  84  92 106  78  71  73  74
##  [37]  81  99  68  88  53  69 131 129 107 102 132 145 113 126 109 115 133 105
##  [55] 117 116 141 147 120 104 118 111 101 143 108 114   2 128 148 140 149 134
##  [73] 123 122 150 124 125 119 144 103 112 138 121 130 137 135 139 146  25  27
##  [91]  21   6  31  16  13  15  26  50   7  38 136  33  46  11  37  20  23  34
## [109]  17  14  12  18   1  49  10  29  43  32  35   3   9  28  39  40  42  47
## [127]  44  36  45  41  19  48   5  24 142 110   4  98  22  65 127  90  67  72
## [145]  83  77  57  89  93  96

This is the order the cells appear in form left to right when you plotted the dendrogram of cells just before. We use this to reorder the z-scored matrix:

hspc.cell.clustered <- hspc.zs[, hspc.hc$order]

To make a heatmap of the data call image:

image(hspc.cell.clustered)

Ok, this doesn’t look like it should! The matrix is the wrong way round, the colours aren’t right, and there are no labels. One downside to R is that getting all this done takes time and knowledge of R’s plotting capabilities. Thankfully people have already done this and put the code into convenient functions/packages for people to install and use.

Exercise: Install the pheatmap package.

Load it first:

library(pheatmap)

We can now use the pheatmap function that the package provides:

pheatmap(hspc.zs)

The pheatmap function uses hclust to cluster the genes and cells and reorders the matrix according to both. Lets output this to a file:

png("HSPC_heatmap.png", height = 4500, width = 1500)
pheatmap(hspc.zs)
dev.off()
## quartz_off_screen 
##                 3

The file is opened, the plot is made, and dev.off() closes and finalises the file (i.e. nothing more can be written to it).

Exercise: call help (pheatmap) and see what options are available. Play with the options to see what they do.

Clustering is pretty pointless if you can’t define groups and get to the gene names. First we need to capture the output from pheatmap as a variable:

hspc.clust <- pheatmap(hspc.zs)

Lets take a look at the contents of hspc.clust:

names(hspc.clust)
## [1] "tree_row" "tree_col" "kmeans"   "gtable"

What we want is the information contained within the hclust object in tree_row. We get this by treating it like a list:

hspc.clust$tree_row
## 
## Call:
## hclust(d = d, method = method)
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 500
hspc.clust$tree_row$order #the order of genes in the heatmap for examples.
##   [1] 164 413 306 314 293 490 387 491 425 476 276 489 434 458 309 360 315 467
##  [19] 324 184 340 372 405 385 469 205 127 354 246 429 249 465 423 485 454 192
##  [37] 368 399 418 400 492 254 449 464 494 123 496 199 313 461 327 272 193 495
##  [55] 367 395 379 497 126 478 353 150 347 444  37  43  63  76  39  56  83 248
##  [73] 380 237 312  65 204  73 394 303 439 374 381 377 180 200 220 427 136 455
##  [91] 487 189 233 432 185 145 167 463 302 247 451 252 337  97 228 155 144 258
## [109]  28 107 152 383 195 142 138 271 161 325 176 277 113 472 398 173 243 244
## [127] 373 250 296 417 488 190 384 234 364 131 460  80 154 263 351 175 159 259
## [145] 334 281 236 441 361 414 401 477 500 121 280 285 257 422 471 301 198 338
## [163] 279 284 346 445 238 355 483 270 298 402 158 262 151 240 171 209 420  98
## [181] 299  57  67 331 147  53  69 124 330 288 349  61 102 129 265 305 153  29
## [199] 187 106  72  99  78  40 108  55 197 235 256 357  49  66 307 160 484 468
## [217]  88 437 282 297 169 239 320 275 177 479 386 216 392 225 462 378 415 179
## [235] 345 352 470  85 409 304 486 370 457 212  62  90 231 323  70 203 318  82
## [253] 382 410  52 130 111 115 213 294 122 103  41  19   4  13  14  27  58  10
## [271] 174  94  46  35   8  24  33 165 132  51  30  31 128  12  34 105 322 369
## [289] 214 393 141  75 223 343 101 391  81 362 140 438 117 396 201 137 311  79
## [307] 435 245 266 283 433 316 371 450 388 499 224  32 229 269  92 221 366 100
## [325] 498 211 426 133 226 242  77 333 143 162 166 264 125 317 109 230 448  16
## [343]  59 157 342  84 206 480 222  64 114 290 219 411 475 170 363 407 202 278
## [361] 253 421 207 344 348  38 186  86  11   9  18 146 319 328 459 112 452 210
## [379]  96 329 188 341 365  68 178 474  48 291 183  45 119  54 134  74   7 215
## [397]   1   5  36 104  20  15   2   3  21  17  23 135 218 332 408 289 292 268
## [415] 404 326 390 191 308  44 358 419 406 456 403 440 397 466 217 267 300  93
## [433] 181  91 116  22  50 412 163  26   6 156  71 232 251 182 442 194  87 208
## [451] 227 339 149 473 273 389 376 120  42  60  25  89 424 443 321 172 428  47
## [469]  95 196 446 335 482 295 493 350 118 430 241 168 139 148 375 431 356 447
## [487] 453 260 286 481 110 255 287 261 416 436 310 336 274 359

Lets say that we want to split the genes in to 5 clusters groups, we can call the cutree function on an hclust object to do this:

gene.clusters <- cutree(hspc.clust$tree_row, k = 5)
gene.clusters[1:20] # shows the results for the first 20 genes.
##         Elane          Ctsg           Mpo          Car1         Ms4a3 
##             1             1             1             2             1 
##           Mpl         Ly6c2          Klf1          Nkg7         Ces2g 
##             3             1             2             1             2 
##          Cst7         Ermap          Aqp1       Gm15915       Clec12a 
##             1             2             2             2             1 
##         Prtn3         Rab44         Plac8         Gata1 F630028O10Rik 
##             1             1             1             2             1
table(gene.clusters)
## gene.clusters
##   1   2   3   4   5 
##  90 296  53  23  38
barplot(table(gene.clusters))

Lets isolate all the genes beloning to cluster 1 using the which command:

which(gene.clusters == 1)
##         Elane          Ctsg           Mpo         Ms4a3         Ly6c2 
##             1             2             3             5             7 
##          Nkg7          Cst7       Clec12a         Prtn3         Rab44 
##             9            11            15            16            17 
##         Plac8 F630028O10Rik        Atp8b4        Tyrobp         Anxa3 
##            18            20            21            23            36 
##         Fcgr3          Ccl9         Alas1          Sell           Emb 
##            38            45            48            54            59 
##          Lyz2           Fes           Hk3      Serpinf1      Tbc1d10c 
##            64            68            74            77            84 
##        Adgrg3         Gstm1         Napsa           Wls         Igsf6 
##            86            92            96           100           104 
##       Unc93b1        Fcer1g      Tmem176a          Cd53          Cd93 
##           109           112           114           119           125 
##       Tsc22d3      BC035044       Emilin1         Anxa1         Spns3 
##           133           134           135           143           146 
##        Gm8995        Sptbn1         Rab32          Gmpr         Myo1g 
##           157           162           166           170           178 
##     Serpinb1a           Hdc         Casp1         Klhl6         Anxa5 
##           183           186           188           202           206 
##        Lgals1        Adssl1         Asah1            Hp        Tcirg1 
##           207           210           211           215           218 
##        Ifi203          Cnn2        Fkbp1a         Cers5       Tmem173 
##           219           221           222           226           230 
##          Ctsc      Mapkapk3          Tcn2         Stap1         Tor3a 
##           242           253           264           269           278 
##      Tmem176b        Ms4a6c          Tfec           Vim        Igfbp4 
##           290           291           317           319           328 
##           Lbp          Ppic          Hexa          Irf1        Csf2rb 
##           329           333           341           342           344 
##        Fcgr2b          Cpa3         Anxa2         Plcb2        Rasal3 
##           348           363           365           366           407 
##        Gpr171          Idh1          Pak1          Lcp1         Parp8 
##           411           421           426           448           452 
##       Pstpip1          Rgs2       Fam189b        Adgrl4         Elmo1 
##           459           474           475           480           498

We can isolate these rows only from our hspc.zs matrix as we did before:

hspc.cluster.1 <- hspc.zs[names(which(gene.clusters == 1)),]

We can now see how these gene behave as a whole using a boxplot:

boxplot(hspc.cluster.1, las = 2)

Efficiency - time is precious.

Something to always remember is that computer resources such as RAM and disk space should be used wisely. One should always aim to create fast and lean code - that is, without compromising readability. Here is a small example where we make a vector from one to 50,000 using a loop:

values <- NULL # make an empty container to catch the values

for(i in 1:50000){
  values <- c(values, i)
}

values[1:10]
##  [1]  1  2  3  4  5  6  7  8  9 10

Lets do this again, but this time measure how long it takes using the system.time function:

values <- NULL # make an empty container to catch the values
system.time(for(i in 1:50000){  values <- c(values, i)})
##    user  system elapsed 
##   1.773   0.012   1.785
length(values) # check length
## [1] 50000

Ok, now the same thing again, but this time we will fill a vector rather that growing it:

values <- vector("numeric",100000) # make an empty container to catch the values
system.time(for(i in 1:100000){  values[i] <- i})
##    user  system elapsed 
##   0.007   0.000   0.007
length(values) #check length
## [1] 100000

Why do you think this was so much faster?

Linear models

Fitting a linear model (or linear regression) is also another fairly common thing to do, and actually forms the basis of some of the more famous packages we use such as DESeq2, limma etc. Let’s use our expression data to demonstrate this.

In our case we have 3 different samples:

colnames(hspc.data)
##   [1] "LTHSC.1"  "LTHSC.2"  "LTHSC.3"  "LTHSC.4"  "LTHSC.5"  "LTHSC.6" 
##   [7] "LTHSC.7"  "LTHSC.8"  "LTHSC.9"  "LTHSC.10" "LTHSC.11" "LTHSC.12"
##  [13] "LTHSC.13" "LTHSC.14" "LTHSC.15" "LTHSC.16" "LTHSC.17" "LTHSC.18"
##  [19] "LTHSC.19" "LTHSC.20" "LTHSC.21" "LTHSC.22" "LTHSC.23" "LTHSC.24"
##  [25] "LTHSC.25" "LTHSC.26" "LTHSC.27" "LTHSC.28" "LTHSC.29" "LTHSC.30"
##  [31] "LTHSC.31" "LTHSC.32" "LTHSC.33" "LTHSC.34" "LTHSC.35" "LTHSC.36"
##  [37] "LTHSC.37" "LTHSC.38" "LTHSC.39" "LTHSC.40" "LTHSC.41" "LTHSC.42"
##  [43] "LTHSC.43" "LTHSC.44" "LTHSC.45" "LTHSC.46" "LTHSC.47" "LTHSC.48"
##  [49] "LTHSC.49" "LTHSC.50" "MEP.1"    "MEP.2"    "MEP.3"    "MEP.4"   
##  [55] "MEP.5"    "MEP.6"    "MEP.7"    "MEP.8"    "MEP.9"    "MEP.10"  
##  [61] "MEP.11"   "MEP.12"   "MEP.13"   "MEP.14"   "MEP.15"   "MEP.16"  
##  [67] "MEP.17"   "MEP.18"   "MEP.19"   "MEP.20"   "MEP.21"   "MEP.22"  
##  [73] "MEP.23"   "MEP.24"   "MEP.25"   "MEP.26"   "MEP.27"   "MEP.28"  
##  [79] "MEP.29"   "MEP.30"   "MEP.31"   "MEP.32"   "MEP.33"   "MEP.34"  
##  [85] "MEP.35"   "MEP.36"   "MEP.37"   "MEP.38"   "MEP.39"   "MEP.40"  
##  [91] "MEP.41"   "MEP.42"   "MEP.43"   "MEP.44"   "MEP.45"   "MEP.46"  
##  [97] "MEP.47"   "MEP.48"   "MEP.49"   "MEP.50"   "GMP.1"    "GMP.2"   
## [103] "GMP.3"    "GMP.4"    "GMP.5"    "GMP.6"    "GMP.7"    "GMP.8"   
## [109] "GMP.9"    "GMP.10"   "GMP.11"   "GMP.12"   "GMP.13"   "GMP.14"  
## [115] "GMP.15"   "GMP.16"   "GMP.17"   "GMP.18"   "GMP.19"   "GMP.20"  
## [121] "GMP.21"   "GMP.22"   "GMP.23"   "GMP.24"   "GMP.25"   "GMP.26"  
## [127] "GMP.27"   "GMP.28"   "GMP.29"   "GMP.30"   "GMP.31"   "GMP.32"  
## [133] "GMP.33"   "GMP.34"   "GMP.35"   "GMP.36"   "GMP.37"   "GMP.38"  
## [139] "GMP.39"   "GMP.40"   "GMP.41"   "GMP.42"   "GMP.43"   "GMP.44"  
## [145] "GMP.45"   "GMP.46"   "GMP.47"   "GMP.48"   "GMP.49"   "GMP.50"

We need to make a vector of factors from this info to specify the different groups.

Exercise: Install the stringr package and call it in your session.

samples <- colnames(hspc.data)
groups <- gsub("\\..*", "", samples) # This will make a new vector of names but remove the ".X" leaving just what type each sample belongs to
groups
##   [1] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
##  [10] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
##  [19] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
##  [28] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
##  [37] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC"
##  [46] "LTHSC" "LTHSC" "LTHSC" "LTHSC" "LTHSC" "MEP"   "MEP"   "MEP"   "MEP"  
##  [55] "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"  
##  [64] "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"  
##  [73] "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"  
##  [82] "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"  
##  [91] "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"   "MEP"  
## [100] "MEP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"  
## [109] "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"  
## [118] "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"  
## [127] "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"  
## [136] "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"  
## [145] "GMP"   "GMP"   "GMP"   "GMP"   "GMP"   "GMP"
lmod <- lm(hspc.data[1,]~ -1+groups)
lmod
## 
## Call:
## lm(formula = hspc.data[1, ] ~ -1 + groups)
## 
## Coefficients:
##   groupsGMP  groupsLTHSC    groupsMEP  
##       2.798        2.379        3.402

Lets call anova on this:

ano <- anova(lmod)
ano
## Analysis of Variance Table
## 
## Response: hspc.data[1, ]
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## groups      3 1252.9  417.62  47.399 < 2.2e-16 ***
## Residuals 147 1295.2    8.81                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the groups factor has a significant value meaning it seems this gene is differentially expressed.

Exercise: Take a look at the ano object and see how to isolate just the p-value from all this info.

ano$`Pr(>F)`[1]

Exercise: Make a function called do.anova which takes a vector vand a vector groups, performs linear modelling and ANOVA and returns just the p-value.

do.anova <- function(v,groups){
  
  mod <- lm(v~ -1+groups)
  p <- anova(mod)$`Pr(>F)`[1]
  p
}

We can now apply this to all of our data:

ps <- apply(hspc.data, MARGIN = 1, FUN = function(x) do.anova(x, groups) )
p.adj <- p.adjust(ps, "hochberg")

Get the top 500 genes:

sig.genes <- names(sort(p.adj))[1:500]

Heatmap them!

library(RColorBrewer)
breaksList = seq(-3, 3, by = 0.2)
hspc.zs.all <- t(apply(hspc.data, 1, zscore))

pheatmap(hspc.zs.all[sig.genes,],breaks = breaksList,color = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList)))

Kmeans

We’re going to program our first algorithm together! Kmeans clustering is a common way to cluster expression data, and the algorithm is actually pretty simple, and a great way to learn how to think programmatically. The data we’ll use comes from yeast where the expression of 256 genes have been measured as a synchronised population go through two divisions. It’s a nice, small dataset with clear clusters.

There is a file called “Spellman_Yeast_Cell_Cycle.tsv” (https://github.com/shambam/R_programming_1/blob/main/Spellman_Yeast_Cell_Cycle.tsv). Load it into a variable called ycc and convert it to a matrix.

ycc <- read.delim("Spellman_Yeast_Cell_Cycle.tsv", row.names = 1, header = T, sep = "\t")
ycc <- as.matrix(ycc)

The data has already been z-scored, so we need to learn how to calculate the Euclidean distance between 2 genes (\(g^{i}\) and \(g^{j}\)) (vector). We do this using the formula:

\[d_{ij}=\sqrt{\sum{(g^{i}_t-g^{j}_t)^2}}\]

Write a function called e.dist which takes two vectors v1 and v2 and calculates the distance between them. In our data, these two vectors (v1 and v2) are the different expression values of two genes through the different time points measured in the yeast cell cycle .

e.dist <- function(v1, v2){
  d <- sqrt(sum((v1-v2)^2))
  d
}

The kmeans algorithm needs the user to say how many clusters are being searched for, so in this case we’ll say 8.

  1. Define 8 centroids randomly, i.e. pick 8 genes at random.
  2. Calculate the distance between each gene to each of the 8 centroids.
  3. Assign each gene to its closest centroid.
  4. For each of the 8 clusters, calculate a new centroid.
  5. Repeat 100 times steps 2, 3, and 4.

We’ll now work through this problem together and crowd source a solution.

centroids <- ycc[sample(1:256, 8, replace=F),] # make a vector of random clusters. We are sampling 8 numbers (since we want 8 centroids) out of the vector 1:256 which represents the genes in ycc. Check that we indeed have 256 genes in ycc!

dists <- matrix(0, nrow = 256, ncol = 8) # make an empty matrix to fill with distances.

clusters <- NULL # make an empty variable to catch the clusters in the loop below
for(iteration in 1:100){ # 100 iterations

  for(gene in 1:256){
  
    for(k in 1:8){
      dists[gene, k] <- e.dist(ycc[gene,], centroids[k,]) # for each gene calculate the distance to each centroid (k).
    }
  }

  clusters <- apply(dists, 1, which.min) # assign a  cluster according to which centroid is nearest

  for(k in 1:8){
    centroids[k,] <- apply(ycc[which(clusters == k),], 2, mean) # define new centroids.
  }
}

Exercise Plot your clusters using base R or ggplots. Dealers choice.

ggplots:

library(reshape2)
library(ggplot2)
ycc.cl <- cbind(as.factor(clusters), ycc)

colnames(ycc.cl)[1] <- c("Cluster")

ycc.df <- data.frame(ID = rownames(ycc.cl), ycc.cl, row.names = NULL)

ytm <- melt(ycc.df, c("ID", "Cluster"))

ggplot(ytm, aes(x = variable, y = value, group = ID)) + geom_line() + facet_wrap(~Cluster, ncol = 3)

base R:

par(mfrow=c(2, 4))

for(k in 1:8){
  
  ycc.c <- ycc[which(clusters == k),]
  plot(ycc.c[1,],ty="l",ylim = range(ycc.c))
  apply(ycc.c, 1, lines)
  
}

Improvements

  1. The code above will do 100 iterations, but what if the algorithm converges after 20? Alter the code so it stops when no further changes are made and help save the planet.
  2. We have a bunch of “magic numbers”. Magic numbers are hard-coded numbers that might raise questions to our code readers (e.g. “where did these came from?”, “why 256 and not 255?”, “what is this 8?”). Improve this by writing a function that will take any data, any number of clusters K, and user-defined number of iterations as arguments. Return a list containing the cluster assignments and data using Kmeans clustering.
  3. Write a function that will plot from the list, a cluster of the user’s choice.

Don’t forget to document your functions nicely so that there is no room for confusions.